BigQuery Parameterization

Google BigQuery Standard SQL supports parameterization. It is interesting to be able to use Python variables defined in the notebook as parameter values for SQL.

This notebook is an example how to use parameterized queries.

Data Preview


In [2]:
%%bq query -n logs_query
SELECT * FROM `cloud-datalab-samples.httplogs.logs_20140615`

In [3]:
%bq sample -q logs_query --count 10


Out[3]:
timestamplatencystatusmethodendpoint
2014-06-15 07:00:01.000102124405GETInteract2
2014-06-15 07:00:00.65276028405GETInteract2
2014-06-15 07:00:00.834251121405GETInteract2
2014-06-15 07:00:00.94307528200GETOther
2014-06-15 07:00:01.159701119200GETOther
2014-06-15 07:00:00.53648648200GETInteract3
2014-06-15 07:00:00.003772122200GETInteract3
2014-06-15 07:00:01.07110749200GETInteract3
2014-06-15 07:00:00.670100103200GETInteract3
2014-06-15 07:00:00.428897144200GETInteract3

(rows: 10, time: 1.9s, 24MB processed, job: job_PQaM8KiFmTkJeZAMZ7Pbt4QKX28)

In [6]:
%%bq query
SELECT endpoint FROM `cloud-datalab-samples.httplogs.logs_20140615` GROUP BY endpoint


Out[6]:
endpoint
Admin
Interact1
Home
Warmup
Create
Popular
Interact3
Recent
Interact2
Other

(rows: 10, time: 0.2s, cached, job: job_yZVXYwERTLvext8mayphi5_5x6g)

Parameterization within SQL queries

Parameters are declared in SQL queries using a @name syntax within the SQL, and then defining name's value when executing the query. Notice you will have to define the query and execute it in two different cells. The shorthand way of running queries (using %%bq query without --name) gives you little control over the execution of the query.


In [7]:
%%bq query -n endpoint_stats
SELECT *
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = @endpoint
LIMIT 10

In [8]:
%%bq execute -q endpoint_stats
parameters:
- name: endpoint
  type: STRING
  value: Interact2


Out[8]:
timestamplatencystatusmethodendpoint
2014-06-15 07:00:03.09308421302GETInteract2
2014-06-15 07:00:03.26338122302GETInteract2
2014-06-15 07:00:01.000102124405GETInteract2
2014-06-15 07:00:03.018770125405GETInteract2
2014-06-15 07:00:00.65276028405GETInteract2
2014-06-15 07:00:01.456010123405GETInteract2
2014-06-15 07:00:00.834251121405GETInteract2
2014-06-15 07:00:01.16691227405GETInteract2
2014-06-15 07:00:01.69421029405GETInteract2
2014-06-15 07:00:03.062767173200POSTInteract2

(rows: 10, time: 1.7s, 24MB processed, job: job_pb3EUsKwE1dpQGqa3lzq28-am6I)

This defined a SQL query with a string parameter named endpoint, which can be filled when executing the query. Let's give it some value in a separate cell:


In [9]:
endpoint_val = 'Interact3'

In order to reference the variable defined above, Google Cloud Datalab offers the $var syntax, which can be invoked in the magic command:


In [10]:
%%bq execute -q endpoint_stats
parameters:
- name: endpoint
  type: STRING
  value: $endpoint_val


Out[10]:
timestamplatencystatusmethodendpoint
2014-06-15 07:00:00.53648648200GETInteract3
2014-06-15 07:00:03.92533646200GETInteract3
2014-06-15 07:00:00.003772122200GETInteract3
2014-06-15 07:00:04.34889947200GETInteract3
2014-06-15 07:00:01.07110749200GETInteract3
2014-06-15 07:00:05.221120382200GETInteract3
2014-06-15 07:00:04.337463297200GETInteract3
2014-06-15 07:00:00.428897144200GETInteract3
2014-06-15 07:00:00.670100103200GETInteract3
2014-06-15 07:00:02.015507363200GETInteract3

(rows: 10, time: 2.2s, 24MB processed, job: job_p5tCdFhq8OxReVfZmntdXH2YIQA)

This can also be achieved using the Python API instead of the magic commands (%%bq). This is how we will create and execute a parameterized query using the API:


In [11]:
import google.datalab.bigquery as bq
endpoint_stats2 = bq.Query(sql='''
SELECT *
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = @endpoint
LIMIT 10
''')

endpoint_value = 'Interact3'

query_parameters = [
  {
    'name': 'endpoint',
    'parameterType': {'type': 'STRING'},
    'parameterValue': {'value': endpoint_value}
  }
]

job = endpoint_stats2.execute(query_params=query_parameters)

job.result()


Out[11]:
timestamplatencystatusmethodendpoint
2014-06-15 07:00:00.53648648200GETInteract3
2014-06-15 07:00:03.92533646200GETInteract3
2014-06-15 07:00:00.003772122200GETInteract3
2014-06-15 07:00:04.34889947200GETInteract3
2014-06-15 07:00:01.07110749200GETInteract3
2014-06-15 07:00:05.221120382200GETInteract3
2014-06-15 07:00:04.337463297200GETInteract3
2014-06-15 07:00:00.428897144200GETInteract3
2014-06-15 07:00:00.670100103200GETInteract3
2014-06-15 07:00:02.015507363200GETInteract3

(rows: 10, time: 2.2s, 24MB processed, job: job_11dsxMuiczeA72TNlpqSYG-ovJg)

Looking Ahead

Parameterization enables one part of the SQL and Python integration: being able to use values in Python code in the notebook, and passing them in as part of the query when retrieving data from BigQuery.

The next notebook will cover the other part of the SQL and Python integration: retrieving query results into the notebook for use with Python code.